Multi-term search result with unsupervised query segmentation method and apparatus

ABSTRACT

Generally, a method and apparatus provides for search results in response to a web search request having at least two search terms in the search request. The method and apparatus includes generating a plurality of term groupings of the search terms and determining a relevance factor for each of the term groupings. The method and apparatus further determines a set of the term groupings based on the relevance factors and therein conducts a web resource search using the set of term groupings, to thereby generate search results. The method and apparatus provides the search results to the requesting entity.

COPYRIGHT NOTICE

A portion of the disclosure of this patent document contains materialwhich is subject to copyright protection. The copyright owner has noobjection to the facsimile reproduction by anyone of the patent documentor the patent disclosure, as it appears in the Patent and TrademarkOffice patent files or records, but otherwise reserves all copyrightrights whatsoever.

FIELD OF THE INVENTION

The present invention relates generally to Internet-based searching andmore specifically to improving search result accuracy in response tosearch requests having more than two search terms.

Existing web-based search systems have difficulty handling searchrequests with numerous search terms. As used herein, numerous searchterms relates to two or more search terms. This is commonly found whensearching is done based on a phrase, such as entering a long searchstring, a popular title, or a song lyric, for example.

Using specific language to better exemplify the existing solutions,suppose a search request is entered having the following search terms:“simmons college sports psychology.” The search engine breaks thissearch request down in an attempt to decipher or otherwise estimatewhich terms are of highest importance for searching. For example, thesearch engine may have to decide between “simmons college” “sportspsychology” and “college sports.”

A first approach is a Mutual information based approach. This approachdetermines correlations between adjacent terms. This is also commonlyknown as the Units Web Service.

In natural language processing, there has been a significant amount ofresearch on text segmentation, such as noun phrase chunking, where thetask is to recognize the chunks that consist of noun phrases, andChinese word segmentation, where the task is to delimit words by puttingboundaries between Chinese characters. Query segmentation is similar tothese problems in the sense that they all try to identify meaningfulsemantic units from the input. However, one may not be able to applythese techniques directly to query segmentation, because Web searchquery language is very different (queries tend to be short, composed ofkeywords), and some essential techniques to noun phrase chunking, suchas part-of-speech tagging, can not achieve high performance when appliedto queries. Thus, detecting noun phrase for information retrieval hasbeen mainly studied in document indexing and has not been addressed insearch queries.

A second approach is a supervised learned approach. This approachapplies a binary decision at each possible segmentation point, where thesegmentation points are the segmentation between various terms. Thisapproach has a limited range context and is specifically designed fornoun phrases. Furthermore, due to the supervised learning aspect, thisapproach requires significant overhead for users to conduct thesupervisory learning.

In terms of unsupervised methods for text segmentation, the expectationmaximization (EM) algorithm has been used for Chinese word segmentationand phoneme discovery, where a standard EM algorithm is applied to thewhole corpus or collection of web resources. Although, running the EMalgorithm over the whole corpus is very expensive.

As such, there exists a need for a search query technique that processesand improves the search results for Internet-based searching operationsusing multi-term search requests.

SUMMARY OF THE INVENTION

Generally, a method and apparatus provides for search results inresponse to a web search request having at least two search terms in thesearch request. The method and apparatus includes generating a pluralityof term groupings of the search terms and determining a relevance factorfor each of the term groupings. The method and apparatus furtherdetermines a set of the term groupings based on the relevance factorsand therein conducts a web resource search using the set of termgroupings, to thereby generate search results. The method and apparatusprovides the search results to the requesting entity.

BRIEF DESCRIPTION OF THE DRAWINGS

The invention is illustrated in the figures of the accompanying drawingswhich are meant to be exemplary and not limiting, in which likereferences are intended to refer to like or corresponding parts, and inwhich:

FIG. 1 illustrates a block diagram of one embodiment of a processingsystem that includes an apparatus for providing search results inresponse to a search request having at least two search terms in thesearch request;

FIG. 2 illustrates a flowchart of the steps of one embodiment of amethod for providing search results in response to a search requesthaving at least two search terms in the search request;

FIG. 3 illustrates a graphical representation of one embodiment of anexemplary unigram model usable for determining relevance factors;

FIG. 4 illustrates a graphical representation of the generation ofsearch term and relevance computation;

FIG. 5 illustrates a graphical representation of another embodiment ofthe generation of search terms and relevance computation; and

FIG. 6 illustrates a graphical representation of another embodiment ofthe generation of search terms and relevance computation.

DETAILED DESCRIPTION OF THE INVENTION

In the following description of the embodiments of the invention,reference is made to the accompanying drawings that form a part hereof,and in which is shown by way of illustration exemplary embodiments inwhich the invention may be practiced. It is to be understood that otherembodiments may be utilized and structural changes may be made withoutdeparting from the scope of the present invention.

FIG. 1 illustrates a system 100 that includes a search engine search 102in communication with a plurality of web resource databases 104, amulti-term search processing device 106 and a storage device 108 havingexecutable instructions 110 stored therein. Further in the system is anetwork connection 112, a user 114 and a user's computer 116.

The server 102 may be any suitable type of search engine server,including any number of possible servers accessible via the network 112using any suitable connectivity. The storage device 104 may be anysuitable type of storage devices in any number of locations accessibleby the server 102. The storage device 104 includes web resourceinformation as used by existing web searching engines and web searchingtechniques.

The processing device 106 may be one or more processing devicesoperative to perform processing operations in response to executableinstructions 110 received from the storage device 108. The storagedevice 108 may be any suitable storage device operative to store theexecutable instructions thereon.

It is further noted that various additional components, as recognized byone skilled in the art, have been omitted from the block diagram of thesystem 100 for brevity purposes only. Similarly, for brevity's sake, theoperation of processing system 100, specifically the processing device106, are described in conjunction with the flowchart of FIG. 2.

FIG. 2 illustrates steps of a method for providing search results. In atypical embodiment, the user 114 enters a web based search request onthe computer 116. The computer 116 may provide an interactive display ofa web page from the web server 102, via the Internet 112. It is alsonoted that the network 112 is generally referred to as the Internet, butmay be any suitable network, (e.g. public and/or private), as recognizedby of ordinary skill in the art.

Prior to the method of FIG. 2, a user may submit the search request withsearch terms on the web search portal. The submitted search requestincludes numerous search terms, including at least two search terms. Asan example, the search request may be a string of four words, e.g.“simmons college sports psychology.” Thereby, in this embodiment of themethod, the first step, step 120, is generating a plurality of termgroupings of the search terms in the search request. This groupingincludes denoting the possible variations of the terms. In the exampleabove, the groupings may include “simmons college,” “simmons sports,”“simmons psychology,” “college sports,” “college psychology,” and“sports psychology.” This step may be performed by the processing device106 in response to the executable instructions 110 from the storagedevice 108 of FIG. 1.

In this embodiment, a next step, step 122, is determining a relevancefactor for each of the term groupings. As described in further detailbelow, this relevance factor may be determined using unigram model. Thisdetermination step may be performed by the processing device 106 inresponse to the executable instructions 110 from the storage device 108of FIG. 1.

Once relevance factors are determined, a next step, step 124, isdetermining a set of the term groupings based on the relevance factors.The term groupings include the terms that are determined to be mostrelevant based on the relevance factors. In one embodiment, as describedbelow, relevancy includes term groupings with the highest relevancescore. By way of example, and for illustration purposes only, this mayinclude determining the set to be the groupings “simmons college” and“sports psychology” from the above example search request. Thisdetermination step may be performed by the processing device 106 inresponse to the executable instructions 110 from the storage device 108of FIG. 1.

FIG. 3 illustrates a graphical representation of an exemplary unigrammodel for the sample search term “simmons college sports psychology.”The illustrated unigram model includes probability calculations for theindependent sampling from a probability distribution of concepts. Forexample, the probability distribution is calculated for P(simmonscollege) and P(sports psychology). This probability distribution is thencompared to the probability distribution of P(simmons), P(collegesports) and P(psychology).

A next step, step 126, is conducting a web resource search using the setof term groupings to generate search results. The web search may be doneby the server 102 in accordance with known searching techniques usingthe set of term groupings. In another embodiment, as described infurther detail below, the searching may be done based a web corpus. Theweb corpus provides a reduced number of resources that are to besearched, hence improving search speed and reducing processing overheadassociated with multi-term searches associated with full search dataloads.

In this embodiment, once the search results have been collected, a finalstep is then providing the search results to a requesting entity, step128. In the embodiment of FIG. 1, this may include generating a searchresults page on the web server 102 and providing the search results pageto the computer 116 via the Internet 112, whereby the user 112 can thenview the search results. In accordance known search result techniques,the results may be active hyperlinks to the specific resourcesthemselves or cached versions of the resource such that upon the user'sselection, the computer 116 may then access the corresponding webresource via the Internet 112.

As described in further detail below, the search may further includeunsupervised learning regarding term groupings. This unsupervisedlearning may include accessing automated name grouping resources, wherethese resources provide direction regarding name groupings. In referenceto these resources, a higher degree of accuracy may be achievedregarding sequencing of search terms and this access being unsupervised,reduces computation overhead associated with manual activity regardingprior name grouping techniques.

By way of example, an automated name grouping resource may include aname entity recognizer, an online user generated content data resource,a noun phrase model or any other suitable resource. The name entityrecognizer produces entities such as business and locations and thesystem may match proposed segmentation against name entity recognitionresults. The online content data may be a recognized source, such as forexample the encyclopedia at Wikipedia.com, which is a human editedrepository that provides recognizable term groupings also by comparison.The noun phrase model computes the probability that a segment is a nounphrase.

It is when the query is uttered (e.g., typed into a search box) that theconcepts are “serialized” into a sequence of words, with theirboundaries dissolved. The task of query segmentation, as describedherein, is to recover the boundaries that separate the concepts.

Given that the basic units in query generation are concepts, anassumption can be made that they are independent andidentically-distributed (I.I.D.). In other words, there is a probabilitydistribution PC of concepts, which is sampled repeatedly, to producemutually-independent concepts that construct a query. This may bedetermined to be a unigram language model, with a gram being not a word,but a concept/segment.

The above I.I.D. assumption carries several limitations. First, conceptsare not really independent of each other. For example, it is more likelyto be observe “travel guide” after “new york” than “new york times”.Second, the probability of a concept may vary by its position in thetext. For example, we expect to see “travel guide” more often at the endof a query than at the beginning. While this problem can be addressed byusing a higher-order model (e.g., the bigram model) and adding aposition variable, this will dramatically increase the number ofparameters that are needed to describe the model. Thus for simplicitythe unigram model is used, and it proves to work reasonably well for thequery segmentation task.

LetT=w₁w₂ . . . w_(n) be apiece of text of n words, and S^(T)=s₁s₂ . . .s_(m) be a possible segmentation consisting of m segments, wheres_(i)=w_(ki)w_(ki+1) . . . w_(ki+1)−1, 1=k₁<k₂<· . . . ·<k_(m+1)=n+1.

For a given query Q, if it is produced by the above generative languagemodel, with concepts repeatedly sampled from distribution P_(C) untilthe desired query is obtained, then the probability of it beinggenerated according to an underlying sequence of concepts (i.e., asegmentation of the query) SQ is:

P(S ^(Q))=P(s ₁)P(s ₂ |s ₁) . . . P(s _(m) |s ₁ S ₂ . . . S_(m−1))  Equation 1

The unigram model provides:

P(s _(i) |s ₁ s ₂ . . . s _(i−1))=P _(C)(s _(i))  Equation 2

Based on Equation 1 in combination with Equation 2, this produces:

$\begin{matrix}{{P\left( S^{Q} \right)} = {\prod\limits_{s_{i} \in S^{Q}}^{\;}\; {P_{C}\left( s_{i} \right)}}} & {{Equation}\mspace{14mu} 3}\end{matrix}$

From this, the cumulative probability of generating Q is:

$\begin{matrix}{{P(Q)} = {\sum\limits_{S^{Q}}{P\left( S^{Q} \right)}}} & {{Equation}\mspace{14mu} 4}\end{matrix}$

In Equation 4, this is where S^(Q) is one of 2^(n−1) differentsegmentations, with n being the number of query words.

For two segmentations S₁ ^(T) and S₂ ^(T) of the same piece of text T,suppose they differ at only one segment boundary, i.e., S₁ ^(T)=s₁s₂ . .. s_(k−1)s_(k+1)S_(k+2) . . . s_(m) and S S₂ ^(T)=s₁s₂ . . .s_(k−1)s′s_(k−1)s′s_(k+1)s_(k+2) . . . s_(m) where s′_(k)=(s_(k)s_(k+1))is the concatenation of s_(k) and s_(k+1).

One embodiment favors segmentations with higher probability ofgenerating the query. In the above case, P(S₁ ^(T))>P(S₂ ^(T)) if andonly if P_(c)(s_(k))P_(c)(s_(k+1))>P_(c)(s′_(k)), i.e., when s_(k) ands_(k+1) are negatively correlated. In other words, a segment boundary isjustified if and only if the pointwise mutual information between thetwo segments resulting from the split is negative:

$\begin{matrix}{{{MI}\left( {s_{k},s_{k + 1}} \right)} = {{\log \; \frac{P_{c}\left( s_{k}^{\prime} \right)}{{P_{c}\left( s_{k} \right)}{P_{c}\left( s_{k + 1} \right)}}} < 0}} & {{Equation}\mspace{14mu} 5}\end{matrix}$

Note that this is differs from the known MI-based approach as it iscomputed above it is between adjacent segments, rather than words. Moreimportantly, the segmentation decision is non-local (i.e., involving acontext beyond the words near the segment boundary of concern): whethers_(k) and s_(k+1) should be joined or split depends on the positions ofs_(k)'s left boundary and s_(k+1)'s right boundary, which in turninvolve other segment decisions.

In enumerating all possible segmentations, the “best” segmentation willbe the one with the highest likelihood to generate the query, in thisembodiment. We can also rank them by likelihood and output the top k.

In practice, segmentation enumeration is infeasible except for shortqueries, as the number of possible segmentations grows exponentiallywith query length. However, the I.I.D. nature of the unigram model makesit possible to use dynamic programming for computing top k bestsegmentations. An exemplary algorithm is included in Appendix I. Thecomplexity is O(n k m log(k m)), where n is query length, and m ismaximum allowed segment length.

One aspect to be addressed in providing search results in response tomulti-term search requests is how to determine the parameters of theunigram language model, i.e., the probability of the concepts, whichtake the form of variable-length n-grams. One embodiment includesunsupervised learning, therefore it is desirable to estimate parametersautomatically from provided textual data.

In one embodiment, a source of data that can be used is a text corpusconsisting of a small percentage sample of the web pages crawled bysearch engine, such as the Yahoo! search engine, for example. We countthe frequency of all possible n-grams up to a certain length (n=1, 2, .. . , 5) that occur at least once in the corpus. It is usuallyimpractical to do this for longer n-grams, as their number growsexponentially with n, posing difficulties for storage space and accesstime. However, for long n-grams (n>5) that are also frequent in thecorpus, it is often possible to approximate their counts using those ofshorter n-grams.

The processing operation computes lower bounds of long n-gram countsusing set in-equalities, and takes them as approximation to the realcounts. For example, the frequency for “harry potter and the goblet offire” can be determined to lie in the reasonably narrow range of [5783,6399], using 5783 as an estimate for its true frequency.

If we have frequencies of occurrence in a text corpus for all n-grams upto a given length, then we can infer lower bounds of frequencies forlonger n-grams, whose real frequencies are unknown. The lower bound isin the sense that any smaller number would cause contradictions withknown frequencies.

Let #(x) denote n-gram x's frequency. Let A, B, C be arbitrary n-grams,and AB, BC, ABC be their concatenations. Let #(AB V BC) denote thenumber of times B follows A or is followed by C in the corpus. Thisgenerates:

#(ABC)=#(AB)+#(BC)−#(AB V BC)  Equation 6

#(ABC)=>#(AB)+#(BC)−#(B)  Equation 7

Equation 6 follows directly from a basic equation on set cardinality,|X∩Y|=|X|+|Y|−|X∪Y| where X is the set of occurrences of B where Bfollows A and Y is the set of occurrences of B where B is followed by C.

Since #(B)=>#(AB V BC), Equation 7 holds.

Therefore, for any n-gram x=w₁w₂ . . . w_(n) (n=>3), if the routinedefines:

$\begin{matrix}{{f_{i,j}(x)}\overset{def}{-}{\# \left( {w_{1}\mspace{14mu} \ldots \mspace{14mu} w_{j}} \right)} + {\# \left( {w_{i}\mspace{14mu} \ldots \mspace{14mu} w_{n}} \right)} - {\# \left( {w_{i}\mspace{14mu} \ldots \mspace{14mu} w_{j}} \right)}} & {{Equation}\mspace{14mu} 8}\end{matrix}$

This generates Equation 9:

$\begin{matrix}{{\# (x)} \geq {\max\limits_{1 < i < j < n}{f_{i,j}(x)}}} & {{Equation}\mspace{14mu} 9}\end{matrix}$

Equation 9 allows for the computation of the frequency lower bound for xusing frequencies for sub-n-grams of x, i.e., compute a lower bound forall possible pairs of (i, j), and choose their maximum. In case #(w₁ . .. w_(j)) or #(w_(i) . . . w_(n)) is unknown, their lower bounds, whichare obtained in a recursive manner, can be used instead. Note that whatwe obtain are not necessarily greatest lower bounds, if all possiblefrequency constraints are to be taken into account. Rather, they arebest-effort estimates using the above set inequalities.

In reality, not all (i, j) pairs need to be enumerated: if i<=i′<j′<=j,then:

f _(i,j)(x)≧f _(i′,j′)(x)  Equation 10

because:

$\begin{matrix}\left( {{\# \left( {i,j} \right)}\overset{def}{-}{\# \left( {w_{i}w_{i + 1}\mspace{14mu} \ldots \mspace{14mu} w_{j}} \right)}} \right) & {{Equation}\mspace{14mu} 11}\end{matrix}$

Equation 11 is in part because of the inequalities used in Equation 7

Equation 10 indicates that there is no need to consider f_(i′,j′)(x) inthe computation of Equation 9 if there is a sub-n-gram w_(i) . . . w_(j)longer than w_(i′) . . . w_(j′) with known frequency. This can save alot of computation.

A second algorithm, as described in Appendix 2, gives the frequencylower bounds for all n-grams in a given query, with complexity O(n²m),where m is the maximum length of n-grams whose frequencies that havebeen counted.

Suppose we have already segmented the entire text corpus into conceptsin a preprocessing step. The methodology can then use Equation 12 sothat the frequency of an n-gram will be the number of times it appearsin the corpus as a whole segment. For example, in a correctly segmentedcorpus, there will be very few “york times” segments (most “york times”occurrences will be in the “new york times” segments), resulting in asmall value of P_(C)(york times), which makes sense. However, havingpeople manually segment the documents is only feasible on smalldatasets; on a large corpus it will be too costly.

$\begin{matrix}{{P_{C}(x)} = \frac{\# (x)}{\sum_{x^{\prime} \in V}{\# \left( x^{\prime} \right)}}} & {{Equation}\mspace{14mu} 12}\end{matrix}$

An alternative is unsupervised learning, which does not needhuman-labeled segmented data, but uses large amount of unsegmented datainstead to learn a segmentation model. Expectation maximization (EM) isan optimization method that is commonly used in unsupervised learning,and it has already been applied to text segmentation. The EM algorithm,the expectation step, the unsegmented data is automatically segmentedusing the current set of estimated parameter values, and in themaximization step, a new set of parameter values are calculated tomaximize the complete likelihood of the data which is augmented withsegmentation information. The two steps alternate until a terminationcondition is reached (e.g. convergence).

The major difficulty is that, when the corpus size is very large (forexample, 1% of crawled web), it will still be too expensive to run thesealgorithms, which usually require many passes over the corpus and verylarge data storage to remember all extracted patterns.

To avoid running the EM algorithm over the whole corpus, one embodimentincludes running EM algorithm only on a partial corpus that is specificto a query. More specifically, when a new query arrives, we extractparts of the corpus that overlap with it (we call this thequery-relevant partial corpus), which are then segmented into concepts,so that probabilities for n-grams in the query can be computed. Allnon-relevant parts unrelated to the query of concern are disregarded,thus the computation cost is dramatically reduced.

We can construct the query-relevant partial corpus in a procedure asfollows. First we locate all words in the corpus that appear in thequery. We then join these words into longer n-grams if the words areadjacent to each other in the corpus, so that the resulting n-gramsbecome longest matches with the query. For example, for the query “newyork times subscription”, if the corpus contains “new york times”somewhere, then the longest match at that position is “new york times”,not “new york” or “york times”. This longest match requirement iseffective against incomplete concepts, which is a problem for the rawfrequency approach as previously mentioned. Note that there is nosegmentation information associated with the longest matches; thealgorithm has no obligation to keep the longest matches as completesegments. For example, it can split “new york times” in the above caseto “new york” and “times” if corpus statistics make it more reasonableto do so. However, there are still two artificial segment boundariescreated at each end of a longest match (which means, e.g., “times”cannot associate with the word “square” following it but not included inthe query).

Because all non-query-words are disregarded, there is no need to keeptrack of the matching positions in the corpus. Therefore, thequery-relevant partial corpus can be represented as a list of n-gramsfrom the query, associated with their longest match counts, as denotedby Equation 13.

={(x,c(x))|xεQ}  Equation 13

In Equation 13, x is an n-gram in query Q, and c(x) is its longest matchcount.

The partial corpus represents frequency information that is mostdirectly related to the current query. We can think of it as a distilledversion of the original corpus, in the form of a concatenation of alln-grams from the query, each repeated for the number of times equal totheir longest match counts, with other words in the corpus allsubstituted by a wildcard, deonted by Equation 14:

$\begin{matrix}{\underset{\underset{c{(x_{1})}}{}}{x_{1}x_{1\mspace{14mu}}\ldots \mspace{14mu} x_{1}}\underset{\underset{c{(x_{2})}}{}}{x_{2}x_{2}\mspace{14mu} \ldots \mspace{14mu} x_{2}}\mspace{14mu} \ldots \mspace{14mu} \underset{\underset{c{(x_{k})}}{}}{x_{k}x_{k}\mspace{14mu} \ldots \mspace{14mu} x_{k}}\underset{\underset{N - {\sum\limits_{i}{{c{(x_{i})}}{x_{i}}}}}{}}{{ww}\mspace{14mu} \ldots \mspace{14mu} w}} & {{Equation}\mspace{14mu} 14}\end{matrix}$

In Equation 14, x₁, x₂, . . . , x_(k) are all n-grams in the query, w isa wildcard word representing words not present in the query, and N isthe corpus length. We denote n-gram x's size by |x|, so N−Σ_(i)c(x_(i))|x_(i)| is the length of the non-overlapping part of the corpus.

Practically, the longest match counts can be computed from rawfrequencies efficiently, which are either counted or approximated usinglower bounds.

Given query Q, let x be an n-gram in Q, L(x) be the set of words thatprecede x in Q, and R(x) be the set of words that follow x in Q. Forexample, if Q is “new york times new subscription”, and x is “new”, thenL(x)={times} and R(x)={york, subscription}.

The longest match count for x is essentially the number of occurrencesof x in the corpus not preceded by any word from L(x) and not followedby any word from R(x), which we denote as a.

Let b be the total number of occurrences of x, i.e., #(x).

Let c be the number of occurrences of x preceded by any word from L(x).

Let d be the number of occurrences of x followed by any word from R(x).

Let e be the number of occurrences of x preceded by any word from L(x)and at the same time followed by any word from R(x). Then it is easy tosee a=b−c−d+e

Algorithm 3, noted in Appendix 3, computes the longest match count. Itscomplexity is O(l²), where l is the query length.

If we treat the query-relevant partial corpus D as a source of textualevidence, we can use maximum a posteriori estimation (MAP), choosingparameters θ (the set of concept probabilities) to maximize theposterior likelihood given the observed evidence, as illustrated inEquation 15.

θ=argmaxP(D|θ)P(θ)  Equation 15

In Equation 15, P(θ) is the prior likelihood of θ. Equation 15 can alsobe rewritten as Equation 16.

θ=argmin (−log P(D|θ)−log P(θ))  Equation 16

In Equation 16, log P(D|θ) is the description length of the corpus, and−log P(θ) is the description length of the parameters. The first partprefers parameters that are more likely to generate the evidence, whilethe second part disfavors parameters that are complex to be described.The goal is to reach a balance between the two by minimizing thecombined description length.

For the corpus description length, Equation 17 provides the followingcalculations according to the distilled corpus representation inEquation 14.

$\begin{matrix}{{\log \; {P\left( D \middle| \theta \right)}} = {{\sum\limits_{x \in Q}^{\;}{\log \; {{P\left( x \middle| \theta \right)} \cdot {c(x)}}}} + {{\log\left( {1 - {\sum\limits_{x \in Q}^{\;}{P\left( x \middle| \theta \right)}}} \right)} \cdot \left( {N - {\sum\limits_{x \in Q}^{\;}{{c(x)}{x}}}} \right)}}} & {{Equation}\mspace{14mu} 17}\end{matrix}$

In Equation 17, x is an n-gram in query Q, c(x) is its longest matchcount, |x| is the n-gram length, N is the corpus length, and P(x|θ) isthe probability of the parameterized concept distribution generating xas a piece of text. The second part of the equation is necessary, as itkeeps the probability sum for n-grams in the query in proportion to thepartial corpus size.

The probability of text x being generated can be summed over all of itspossible segmentations, as shown by Equation 18.

$\begin{matrix}{{P\left( x \middle| \theta \right)} = {\sum\limits_{S^{x}}{P\left( S^{x} \middle| \theta \right)}}} & {{Equation}\mspace{14mu} 18}\end{matrix}$

In equation 18, Sx is a segmentation of n-gram x. Note that Sx arehidden variables in our optimization problem.

For the description length of prior parameters θ, it is computed asnoted in Equation 19.

$\begin{matrix}{{\log \; {P(\theta)}} = {\alpha {\sum\limits_{x \in \theta}{\log \mspace{11mu} {P\left( x \middle| \theta \right)}}}}} & {{Equation}\mspace{14mu} 19}\end{matrix}$

In Equation 19, α is a predefined weight, xεθ means the conceptdistribution has a non-zero probability for x, and P(x|θ) is computed asabove. This is equivalent to adding a to the longest match counts forall n-grams in the lexicon θ. Thus, the inclusion of long yet infrequentn-grams in the lexicon is penalized for the resulting in-crease inparameter description length.

To estimate the n-gram probabilities with the above minimum descriptionlength set-up, one technique is to use variant Baum-Welch algorithms asknown in the art. We also follow the variant Baum-Welch algorithms todelete from the lexicon all n-grams that reduce the total descriptionlength when deleted. The complexity of the algorithm is O(kl), where kis the number of different n-grams in the partial corpus, and l is thenumber of deletion phases. In practice, the above EM algorithm convergesquickly and can be done without user's awareness.

For further description, FIGS. 4-6 illustrate parameter estimationsolutions that may be included in the performance of the method and theoperations of the apparatus performing the method. FIG. 4 illustrates apossible parameter estimation solution for offline segmentation of theweb corpus and to then collect counts for n-grams being segments. Forexample, this search includes a sample web resource for a search term,such as the book title “Harry Potter and the Goblet of Fire.” In thisresource, it is noted that the full “harry potter and the goblet offire” string is found, based on the +1 designation and the “potter andthe goblet of” is specifically designated, outside of the fulldescriptive string noted above, hence the +0 designation.

FIGS. 5 and 6 illustrate another parameter estimation solution. Thissolution includes an online computation where the methodology onlyconsiders parts the web corpus overlapping with the query or the longestmatches in the query. As described above, this technique includesgeneration the web corpus first and performing the analysis on this webcorpus, thereby reducing the processing overhead and processing time. InFIG. 5, the query is “harry potter and the goblet of fire” and in FIG.6, the query is “potter and the goblet.” From these query sets, theparameter estimations may be performed consistent with the computationsdescribed above.

FIGS. 1 through 6 are conceptual illustrations allowing for anexplanation of the present invention. It should be understood thatvarious aspects of the embodiments of the present invention could beimplemented in hardware, firmware, software, or combinations thereof. Insuch embodiments, the various components and/or steps would beimplemented in hardware, firmware, and/or software to perform thefunctions of the present invention. That is, the same piece of hardware,firmware, or module of software could perform one or more of theillustrated blocks (e.g., components or steps).

In software implementations, computer software (e.g., programs or otherinstructions) and/or data is stored on a machine readable medium as partof a computer program product, and is loaded into a computer system orother device or machine via a removable storage drive, hard drive, orcommunications interface. Computer programs (also called computercontrol logic or computer readable program code) are stored in a mainand/or secondary memory, and executed by one or more processors(controllers, or the like) to cause the one or more processors toperform the functions of the invention as described herein. In thisdocument, the terms memory and/or storage device may be used togenerally refer to media such as a random access memory (RAM); a readonly memory (ROM); a removable storage unit (e.g., a magnetic or opticaldisc, flash memory device, or the like); a hard disk; electronic,electromagnetic, optical, acoustical, or other form of propagatedsignals (e.g., carrier waves, infrared signals, digital signals, etc.);or the like.

Notably, the figures and examples above are not meant to limit the scopeof the present invention to a single embodiment, as other embodimentsare possible by way of interchange of some or all of the described orillustrated elements. Moreover, where certain elements of the presentinvention can be partially or fully implemented using known components,only those portions of such known components that are necessary for anunderstanding of the present invention are described, and detaileddescriptions of other portions of such known components are omitted soas not to obscure the invention. In the present specification, anembodiment showing a singular component should not necessarily belimited to other embodiments including a plurality of the samecomponent, and vice-versa, unless explicitly stated otherwise herein.Moreover, applicants do not intend for any term in the specification orclaims to be ascribed an uncommon or special meaning unless explicitlyset forth as such. Further, the present invention encompasses presentand future known equivalents to the known components referred to hereinby way of illustration.

The foregoing description of the specific embodiments so fully revealthe general nature of the invention that others can, by applyingknowledge within the skill of the relevant art(s) (including thecontents of the documents cited and incorporated by reference herein),readily modify and/or adapt for various applications such specificembodiments, without undue experimentation, without departing from thegeneral concept of the present invention. Such adaptations andmodifications are therefore intended to be within the meaning and rangeof equivalents of the disclosed embodiments, based on the teaching andguidance presented herein. It is to be understood that the phraseologyor terminology herein is for the purpose of description and not oflimitation, such that the terminology or phraseology of the presentspecification is to be interpreted by the skilled artisan in light ofthe teachings and guidance presented herein, in combination with theknowledge of one skilled in the relevant art(s).

While various embodiments of the present invention have been describedabove, it should be understood that they have been presented by way ofexample, and not limitation. It would be apparent to one skilled in therelevant art(s) that various changes in form and detail could be madetherein without departing from the spirit and scope of the invention.Thus, the present invention should not be limited by any of theabove-described exemplary embodiments, but should be defined only inaccordance with the following claims and their equivalents.

APPENDIX I Input: query w₁w₂ ... w_(n), concept probability distributionP_(c) Output: top k segmentations with highest likelihood B[i]: top ksegmentations for sub-text w₁w₂ ... w_(i) For each segmentation b εB[i], segs denotes the segments and prob denotes the likelihood of thesub-text given this segmentation for i in [1..n]   s ← w₁w₂ ... w_(i)  if P_(C)(s) > 0     a ← new segmentation     a.segs ← {s}     a.pr ob← P_(C)(s)     B[i] ← {a}   for j in [1..i − 1]     for b in B[j]      s ← w_(j)w_(j+1) ... w_(i)       if P_(C)(s) > 0         a ← newsegmentation         a.segs ← b.segs ∪ {s}         a.prob ← b.prob ×P_(C)(s)         B[i] ← B[i] ∪ {a}   sort B[i] by prob   truncate B[i]to size k return B[n]

APPENDIX II Input: query w₁w₂ ... w_(n), frequencies for all n-grams notlonger than m Output: frequencies (or their lower bounds) for alln-grams in the query C[i, j]: frequency (or its lower bound) for n-gramw_(i) ... w_(j) for l in [1..n]   for i in [1..n − l + 1]     j ← i + l− 1     if #(w_(i) ... w_(j)) is known       C[i, j] ← #(w_(i) ...w_(j))     else       C[i, j] ← 0       for k in [i + 1..j − m]        C[i, j] ← max (C[i, j], C[i, k + m − 1]         +C[k, j] − C[k,k + m − 1]) return C

APPENDIX III Input: query Q, n-gram x, frquencies for all n-grams in QOutput: longest match count for x c(x) ← #(x) for l ε L(x)   c(x) ← c(x)− #(lx) for r ε R(x)   c(x) ← c(x) − #(xr) for l ε L(x)   for r ε R(x)    c(x) ← c(x) + #(lxr) return c(x)

1. A method for providing search results in response to a web searchrequest having at least two search terms in the search request, themethod comprising: generating a plurality of term groupings of thesearch terms; determining a relevance factor for each of the termgroupings; determining a set of the term groupings based on therelevance factors; conducting a web resource search using the set ofterm groupings to generate search results; and providing the searchresults to a requesting entity.
 2. The method of claim 1, wherein thegenerating the plurality of term groupings includes accessing anautomated name grouping resource.
 3. The method of claim 1, wherein theautomated name grouping resource includes at least one of: a name entityrecognizer, an online user-generated-content data resource and a nounphrase model.
 4. The method of claim 1, wherein the grouping relevanceis based on a ranking by probability of the grouping being generated bya unigram model.
 5. The method of claim 4, wherein the probability isbased on a maximum likelihood estimate.
 6. The method of claim 1 furthercomprising: generating a web corpus overlapping with search results forthe search request; and conducting the web resource search on the webcorpus.
 7. The method of claim 6 further comprising: adjusting the termgroupings based on probabilities; and adjusting the web corpus based onthe adjusted term groupings.
 8. An apparatus for providing searchresults in response to a web search request having at least two searchterms in the search request, the apparatus comprising: acomputer-readable medium having executable instructions stored thereon;and a processing device, in response to the executable instructions,operative to: generate a plurality of term groupings of the searchterms; determine a relevance factor for each of the term groupings;determine a set of the term groupings based on the relevance factors;conduct a web resource search using the set of term groupings togenerate search results; and provide the search results to a requestingentity.
 9. The apparatus of claim 8, wherein the generating theplurality of term groupings includes accessing an automated namegrouping resource.
 10. The apparatus of claim 8, wherein the automatedname grouping resource includes at least one of: a name entityrecognizer, an online user-generated-content data resource and a nounphrase model.
 11. The apparatus of claim 8, wherein the groupingrelevance is based on a ranking by probability of the grouping beinggenerated by a unigram model.
 12. The apparatus of claim 11, wherein theprobability is based on a maximum likelihood estimate.
 13. The apparatusof claim 8, the processing device, in response to the executableinstructions, is further operative to: generate a web corpus overlappingwith search results for the search request; and conduct the web resourcesearch on the web corpus.
 14. The apparatus of claim 13 the processingdevice, in response to the executable instructions, is further operativeto: adjust the term groupings based on probabilities; and adjust the webcorpus based on the adjusted term groupings.
 15. A computer readablemedium having executable instructions stored thereon such that, whenreads by a processing device, the executable instructions provide amethod for providing search results in response to a web search requesthaving at least two search terms in the search request, the methodcomprising generating a plurality of term groupings of the search terms;determining a relevance factor for each of the term groupings;determining a set of the term groupings based on the relevance factors;conducting a web resource search using the set of term groupings togenerate search results; and providing the search results to arequesting entity.
 16. The computer readable medium of claim 15, whereinthe generating the plurality of term groupings includes accessing anautomated name grouping resource.
 17. The computer readable medium ofclaim 15, wherein the automated name grouping resource includes at leastone of: a name entity recognizer, an online user-generated-content dataresource and a noun phrase model.
 18. The computer readable medium ofclaim 15, wherein the grouping relevance is based on a ranking byprobability of the grouping being generated by a unigram model.
 19. Thecomputer readable medium of claim 18, wherein the probability is basedon a maximum likelihood estimate.
 20. The computer readable medium ofclaim 15, where the method further includes: generating a web corpusoverlapping with search results for the search request; and conductingthe web resource search on the web corpus.